Skip to content

Fix QNN runner KV cache bitwidth detection in Android JNI#18731

Closed
abhinaykukkadapu wants to merge 1 commit intomainfrom
abhinayk/fix-qnn-kv-bitwidth-android-jni
Closed

Fix QNN runner KV cache bitwidth detection in Android JNI#18731
abhinaykukkadapu wants to merge 1 commit intomainfrom
abhinayk/fix-qnn-kv-bitwidth-android-jni

Conversation

@abhinaykukkadapu
Copy link
Copy Markdown
Contributor

Summary

  • The QNN runner in the Android JNI layer was hardcoded to Runner<uint16_t>, but models can use either 8-bit or 16-bit KV caches
  • This mismatch caused gibberish output in the Android demo app while the CLI runner worked correctly
  • Now dynamically queries get_kv_io_bit_width from the model (mirroring qnn_llama_runner.cpp) and instantiates the correct Runner<uint8_t> or Runner<uint16_t>
  • Also passes temperature_ to the Runner constructor (was previously omitted)

Fixes #18571
Closes #17622

Test plan

  • Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp compiles cleanly with both template instantiations
  • Gradle unit tests pass (testDebugUnitTest)
  • On-device test with QNN model (in progress)

cc @cccclai @kirklandsign @infil00p

Summary:
The QNN runner in the Android JNI layer was hardcoded to use
Runner<uint16_t>, but models can be exported with either 8-bit or
16-bit KV caches. This mismatch caused the KV cache data to be
misinterpreted, resulting in gibberish output in the Android demo app
while the same model worked correctly via the CLI runner.

This change mirrors the dynamic KV bitwidth detection already present
in qnn_llama_runner.cpp by querying the model's get_kv_io_bit_width
method and instantiating the correct Runner<uint8_t> or
Runner<uint16_t> accordingly. Also passes temperature_ to the Runner
constructor which was previously omitted.

Fixes #18571
Closes #17622

Test Plan:
- Built Android AAR with QNN support (SDK 2.37) — jni_layer_llama.cpp
  compiles cleanly with both Runner<uint8_t> and Runner<uint16_t>
  template instantiations
- Unit tests pass (gradlew testDebugUnitTest)
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot Bot commented Apr 7, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18731

Note: Links to docs will display an error until the docs builds have been completed.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 7, 2026
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 7, 2026

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[QNN] Using QWEN3-1.7B model inference on Qualcomm 8550, the correct answer cannot be generated.

1 participant